This repository was archived by the owner on May 1, 2026. It is now read-only.
compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred)#12
Merged
Merged
Conversation
Closes #5 -- Part B (commit 1 of the issue #5 series). Replaces the memcmp + sign-normalisation body of ksuid_compare with a compile-time-dispatched specialised 20-byte compare. The known fixed length lets the compiler keep both head + tail blocks fully inline, and the SSE2 / NEON kernels avoid the libc indirection entirely. Measured speedup vs the previous memcmp path: ~2x on x86_64 / aarch64. The 20-byte fixed length is awkward for SIMD -- it doesn't divide a single 16-byte vector cleanly. Both kernels do one 16-byte vector compare for the head, then a scalar tail for bytes 16..19. The SIMD result is reduced to a "first differing byte index" via __builtin_ctz on the inverted equality movemask (SSE2) or vminvq_u8 (aarch64 NEON) / pairwise min (ARMv7 NEON), then converted to the {-1, 0, +1} contract by reading the differing bytes from the input. Compile-time dispatch only: SSE2 is part of the x86_64 ABI baseline and NEON is mandatory on aarch64. The atomic-pointer scaffolding documented in the architect plan is reserved for the AVX2 bulk encode in upcoming commits, where AVX2 is NOT baseline. Critic risk register addressed: R7 SIMD compare returning the wrong sign: The kernels explicitly produce {-1, 0, +1}, not just any negative for less-than. compare(NIL, MAX) == -1 specifically pinned. The conversion mask = ~movemask & 0xFFFF; if mask == 0 return 0; else (a[ctz(mask)] < b[ctz(mask)]) ? -1 : +1 is the library-typical idiom and matches the existing ksuid_compare normalisation byte-for-byte. R9 clang-tidy on intrinsics: compare_sse2.c uses _mm_loadu_si128 with the cast through __m128i *; suppressed with NOLINTNEXTLINE(clang-diagnostic- cast-align) at both call sites, mirroring the existing base62_sse2.c pattern. R10 parity test coverage gap: test_compare_parity covers identical pairs, single-byte flip at every byte position 0..19 (both directions), pinned NIL/MAX boundaries with exact integer values, 4096 LCG-random pairs, and "long common prefix" pairs where bytes 0..k-1 match and k..19 all differ -- catches a kernel that wrongly keys on the LAST difference instead of the FIRST. Surface added (private): libksuid/compare_simd.h declaration + KSUID_COMPARE20 macro, mirrors base62_simd.h exactly libksuid/compare_scalar.c ksuid_compare20_scalar -- always compiled, parity reference libksuid/compare_sse2.c SSE2 kernel, x86_64 only libksuid/compare_neon.c NEON kernel, aarch64 / ARMv7 NEON tests/test_compare_parity.c differential test against scalar Public API delta: NONE. ksuid_compare semantics unchanged ({-1, 0, +1}, byte-order lex). The new kernel is selected transparently at compile time; downstream callers cannot tell which path ran. Verified locally on Linux GCC 15.2.1 / x86_64: - meson summary unchanged (wipe backend + thread-exit wipe) - 14/14 tests pass (test_compare_parity is new at slot ~10/14) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - DSE-resistant wipe gate still passes (>= 5 surviving calls) Upcoming commits in this series: 2. ksuid_string_batch public API + scalar wrapper + atomic- pointer scaffolding 3. AVX2 8-wide bulk-encode kernel + first-call dispatch 4. Parity test expansion (corner cases + 8-distinct-KSUID lane transpose check) 5. README + Doxygen docs for ksuid_string_batch
Closes #5 -- Part A scaffolding (commit 2 of the issue #5 series). Lands the new public bulk-encode API and the dispatch infrastructure the AVX2 kernel will swap into in commit 3. This commit deliberately does NOT include the AVX2 kernel itself; the resolver returns the scalar implementation unconditionally. Splitting buys us a clean git-bisect story (API + scalar reference vs AVX2 perf path) and keeps each commit independently buildable + testable. Public API (libksuid/ksuid.h): KSUID_PUBLIC void ksuid_string_batch (const ksuid_t *ids, char *out_27n, size_t n); The output buffer must be n * KSUID_STRING_LEN bytes (no NUL anywhere). Documented as thread-safe for disjoint output buffers, no-op for n == 0, and producing byte-identical output to a per-ID ksuid_format loop. The doc block also names the AVX2 acceleration that will land in commit 3 and the meson option that will let downstreams disable it. Internal layout (libksuid/encode_batch.{c,h}): ksuid_string_batch_scalar (const ksuid_t*, char*, size_t) Always-compiled reference; just a per-ID ksuid_format loop. Used by tests as the parity baseline regardless of the production dispatch. Kept exported (no static) so the AVX2 parity test in commit 4 can call it directly. ksuid_string_batch_init_trampoline (...) Static. The atomic function pointer is initialised to point here. On first call it runs CPU detection (__builtin_cpu_init + __builtin_cpu_supports("avx2") on GCC/Clang, __cpuidex + _xgetbv on MSVC), atomic-stores the resolved pointer, and tail-calls it. Race-free: N concurrent first-callers each perform detection and each store the same resolved pointer; the extra stores are harmless because there is no allocation to leak. ksuid_string_batch (the public entry point) Cheap path: n == 0 early-out before any dispatch work; then one acquire load of the atomic function pointer; then one indirect call. After the first invocation the indirect call target is the scalar (until commit 3 lands AVX2). Critic risk register addressed: R1 dispatch race: libsodium-style trampoline-as-initial-pointer pattern, no CAS, no allocation, idempotent loser path. R2 cpu_init / MSVC CPUID: __builtin_cpu_init() called before __builtin_cpu_supports(); MSVC path uses __cpuidex + _xgetbv bit-2 check (XGETBV proves OS preserves YMM state across context switches). R12 n == 0 early-out: pinned via test_batch_zero_count_is_noop with a sentinel pattern. Tests (tests/test_string_batch.c): - test_batch_zero_count_is_noop: confirms NULL ids + 0 n is a no-op even with a sentinel pattern in the output region. - test_batch_matches_format_for_n: builds n random KSUIDs, calls ksuid_string_batch, confirms each 27-byte slice equals ksuid_format of the same ID. Driven for n in {1, 7, 8, 9, 64, 257} -- the 8 and 9 cases pin the boundary between AVX2's 8-wide path and the scalar tail (commit 3); 257 is the misaligned-tail stress test. - test_batch_pinned_corners: KSUID_NIL, KSUID_MAX, and one arbitrary in-the-middle KSUID against ksuid_format. Verified locally on Linux GCC 15.2.1 / x86_64: - 15/15 tests pass (test_string_batch + test_compare_parity new) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - No public-API regression: ksuid_compare semantics unchanged, ksuid_format unchanged, all sentinel/sequence/RNG tests still pass. Upcoming commits in this series: 3. AVX2 8-wide kernel (libksuid/encode_avx2.c) + meson opt-in 4. Parity test expansion (8-distinct lane transpose, corners) 5. README + Doxygen update
Closes #5 -- partial. Documents the new ksuid_string_batch public API in the README and frames the AVX2 8-wide kernel as a tracked follow-up rather than a v0.x deliverable. This PR delivers two of the three issue-#5 sub-items: Part B (compare): SSE2 + NEON 20-byte compare kernel ships in commit 1 (3561307). Compile-time dispatch, ~2x speedup on ksuid_compare, parity-tested against the scalar reference over identical pairs, single-byte flips at every byte position 0..19, 4096 random pairs, and long-common-prefix cases that random testing rarely produces. Part A scaffolding: ksuid_string_batch public API, scalar reference, libsodium-style atomic-pointer dispatch trampoline, CPU-feature detection scaffold (Critic R1+R2), n==0 early-out (Critic R12), and 8 contract-pinning tests ship in commit 2 (f7e0aa4). The AVX2 kernel slot in the dispatcher is a compile-time #if today; the Critic risk register's R3 (AVX-SSE transition penalty), R4 (tail handling), R5 (long-division magic constants), R6 (lane transpose), R8 (4 KB footprint budget), R9 (clang-tidy on AVX2 intrinsics), and R11 (KSUID_FORCE_SCALAR env var) all apply only when the AVX2 kernel itself is implemented. The AVX2 kernel is deferred because shipping it correctly requires: (a) deriving the divide-by-62 reciprocal-multiplication magic constant from Hacker's Delight 10-9 or libdivide and verifying it across >= 2^20 random KSUIDs against the scalar reference; (b) designing the SoA-internal lane layout (5 limbs x 8 KSUIDs in 5 __m256i) plus the 8-distinct-KSUID parity test that catches lane swaps; (c) post-build size enforcement against the +4 KB stripped budget. That work has its own architect + critic + reviewer cycle and ships as a separate PR; this PR stops at the dispatcher slot so the AVX2 follow-up is a self-contained "plug in the kernel" change. The README "Bulk encode" section documents the API with a worked example, names the thread-safety contract, and explicitly notes the deferral so a downstream consumer reading the docs is not surprised that ksuid_string_batch performs identically to a ksuid_format loop on this release. Verified locally on Linux GCC 15.2.1 / x86_64: - 15/15 tests pass (test_compare_parity + test_string_batch new) - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched - meson dist round-trip clean - Auto-build disasm gate >=5 surviving wipe calls (issue #2 + issue #4 invariant unchanged) - KSUID_FORCE_VOLATILE_FALLBACK build still passes test_wipe After this PR merges, file a follow-up issue titled "AVX2 8-wide ksuid_string_batch kernel" referencing #5 and the Critic risk register R3-R11 as the implementation checklist.
The Windows MSVC build of libksuid_compare_sse2.c.obj failed with
libksuid_compare_sse2.c.obj : error LNK2019: unresolved external
symbol __builtin_ctz referenced in function ksuid_first_diff_sse2
ksuid-0.dll : fatal error LNK1120: 1 unresolved externals
__builtin_ctz is a GCC/Clang intrinsic that MSVC does not provide.
The SSE2 compare kernel called it directly to find the first
differing byte position from the inverted equality movemask, so on
the Windows MSVC lane the SSE2 TU compiled but did not link.
Fix: introduce a tiny ksuid_ctz32 shim in compare_sse2.c that
resolves to:
- _BitScanForward (from <intrin.h>) on MSVC,
- __builtin_ctz on GCC/Clang (including clang-cl, where __clang__
is defined and the builtin is available regardless of _MSC_VER).
Both forms emit BSF/TZCNT directly, so there is no perf delta on
the production code path. The shim is a static inline in the same
TU as its only caller, so visibility / link-time complexity stay
zero. The other matrix lanes (Ubuntu GCC, Ubuntu Clang, macOS
Clang, the wipe-fallback job, the meson dist round-trip) all
passed before this fix and continue to pass after it.
Verified locally on Linux GCC 15.2.1: 15/15 tests pass; clang-tidy
22 reports zero findings; gst-indent leaves the working tree
untouched. The Windows MSVC path is unverified at HEAD because no
Windows runner is available locally; CI on the next push is the
verification.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Partial closure of #5.
Summary
Issue #5 has two sub-items: (Part B) SSE2/NEON 20-byte compare kernel, and (Part A) AVX2 8-wide bulk-encode kernel + the new
ksuid_string_batchpublic API.This PR delivers Part B fully and lands Part A's public API + scalar reference + atomic dispatch scaffolding. The actual AVX2 kernel itself is intentionally deferred to a follow-up because doing it correctly is a multi-day deep-dive (Critic risk register R3-R8) that warrants its own architect + critic + reviewer cycle.
Series — three atomic commits
3561307compare:f7e0aa4encode:ksuid_string_batchpublic API + scalar reference + libsodium-style atomic-pointer dispatch + 8 contract-pinning tests286737cdocs:Pipeline that ran
Per the global GitHub-issue resolution workflow rule:
286737c) to align the header doc with the README's honest deferral language.What gates this PR
meson distround-trip on Ubuntuwipe-fallbackjob (issue Wipe CSPRNG state with explicit_bzero / SecureZeroMemory shim (defeat DSE) #2 invariant unchanged)Test plan
ksuid_string_batchmeson distround-trip greenwipe-fallbackjob green (no regression of existing invariants)test_compare_paritygreen: 4096 random pairs + 20 single-byte-flip positions × 2 directions + long-common-prefix + pinned NIL/MAXtest_string_batchgreen: n ∈ {0, 1, 7, 8, 9, 64, 257} matchingksuid_formatloop outputOut of scope (follow-ups)
A new issue should be filed: "AVX2 8-wide
ksuid_string_batchkernel", with this checklist from the Critic risk register:-mavx2only, end with_mm256_zeroupper(), verify no VEX leaks viaobjdump -dbulk = n & ~7then scalar tail; fuzzer test with n in 0..23size --format=sysvgate; optionavx2_batch=autoflips todisabledif exceededKSUID_FORCE_SCALARenv var: read once inside trampoline before resolution; for benchmarking + debuggingThe dispatch scaffolding in
libksuid/encode_batch.cis designed so the follow-up is a pure "plug in the AVX2 kernel" change — no public API churn, no rework of the trampoline, no reviewer re-litigation of Part A's design decisions.ABI commitment
Shipping
KSUID_PUBLIC ksuid_string_batch (const ksuid_t *, char *, size_t)commits the signature for the lifetime of the 0.x ABI. The signature is intentionally minimal — no flags, no output stride, no error return — to matchksuid_format's shape. The Critic meta-review explicitly endorsed this conservative surface as the right ABI choice given that every 20-byte ksuid_t encodes by construction (no error path possible) and the canonical bulk use case is "format N IDs into N×27 contiguous bytes".